Overview

Column

Data Set

Observations

Column

Overall Analysis

This is the personal medical cost data set with the variables age, sex, bmi, children, smoker, region, and charges. The analysis indicates that most medical cost personal data originates from the southeast region, while the northeast, northwest, and southwest regions contribute roughly equal amounts. The body mass index (BMI) distribution appears normally distributed, whereas the charges distribution is heavily skewed right, with outliers present in all regions except the northwest. Generally, non-smokers tend to have lower charges, although there are instances where charges are consistently within the 20k to 30k range irrespective of smoking status. A smooth line is suggested to summarize the relationship between age and charges, providing insight into the underlying trend, though it may overlook complex patterns and outliers. The majority of individuals in the dataset have no children, with the percentage decreasing as the number of children increases, and outliers are particularly noticeable in the categories of 0 children and 1 child. Overall, most plots exhibit a right-skewed distribution.

Bar Plot

Column

Column

Analysis

Most medical cost personal data comes from the southeast region. The northeast, northwest, and southwest are about the same amount.

Stack Bar Plot

Column

Histogram bmi

Column

Column

Analysis

The histogram of the bmi distribution looks normally distributed.

Histogram Charges

Column

Column

Analysis

The histogram of the charges distribution is heavily skewed right.

Boxplot bmi

Column

Column

Analysis

The northwest region is the only plot without an outlier. All of the regions looks normally distributed. Southeast region quartile range is on the bigger side compared to the rest.

Scatterplot

Column

Scatterplot 1

Scatterplot 2

Column

Analysis of Scatterplot 1

At certain areas of the plot there is like 3 groups of points that are together. It also looks like the older you are the more you will be charged.

Analysis of Scatterplot 2

It looks like if you did not smoke overall your charges were less compared to those that did smoke. In certain areas of the graph it also looks like even if you did not smoke your charge is around the 20k to 30k range.

Scatterplot Extra

Column

Scatterplot 1

Scatterplot 2

Column

Analysis of Scatterplot 1

The smooth line goes in between the 2 groups of points. I would believe it would make senese to use the smooth line to summarize the relationship between age of clients and the corresponding charges. It helps capture the underlying trend in the data and provides an initial insight into the relationship. However, it may not capture more complex patterns and outliers.

Analysis of Scatterplot 2

In this case the line is closer to more of the points. There are minorities of scattered points above the line.

Overall

There are many things you could do if you want to model charges using other variables in this data. You could identify and handle outliers that might impact the data. You could also find the impact of categorical variables like sex or region on charges

Pie Chart

Column

Column

Analysis

A majority of the people in the data have 0 children. The more children a person has the lower the percentage.

Boxplot Extra

Column

Column

Analysis

There are many outliers much more in 0 children and 1 child compared to the rest. It also looks like all of the plots are skewed right.

---
title: "Assignment 7"
output: 
  flexdashboard::flex_dashboard:
    theme:
      version: 4
      bootswatch: default
      navbar-bg: "purple"
    orientation: columns
    vertical_layout: fill
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(DT)
insurance <- read_csv("./insurance.csv")
```


Overview
===

Column {data-width=550}
---

### <b><font size = 4><span Style = "color:black">Data Set 
Observations</span></font></b>
```{r table}
datatable(insurance, rownames = FALSE, colnames= c("age", "sex", "bmi", "children", "smoker", "region", "charges"), options = list(pageLength = 20))
```

Column {data-width=450}
---

### Overall Analysis
This is the personal medical cost data set with the variables age, sex, bmi, children, smoker, region, and charges. The analysis indicates that most medical cost personal data originates from the southeast region, while the northeast, northwest, and southwest regions contribute roughly equal amounts. The body mass index (BMI) distribution appears normally distributed, whereas the charges distribution is heavily skewed right, with outliers present in all regions except the northwest. Generally, non-smokers tend to have lower charges, although there are instances where charges are consistently within the 20k to 30k range irrespective of smoking status. A smooth line is suggested to summarize the relationship between age and charges, providing insight into the underlying trend, though it may overlook complex patterns and outliers. The majority of individuals in the dataset have no children, with the percentage decreasing as the number of children increases, and outliers are particularly noticeable in the categories of 0 children and 1 child. Overall, most plots exhibit a right-skewed distribution.


Bar Plot
===

Column {data-width=500}
---

```{r Barplot}
ggplot(insurance, aes(x = region, fill = region)) +
  geom_bar() +
  ggtitle("Distribution of Region") +
  xlab("Region") +
  ylab("Count") +
  theme_minimal()
```

Column {data-width=500}
---

### Analysis
Most medical cost personal data comes from the southeast region. The northeast, northwest, and southwest are about the same amount.

Stack Bar Plot
===

Column {data-width=1000}
---

```{r Bar Stack}
ggplot(insurance, aes(x = region, fill = smoker)) +
  geom_bar(position = "fill") +
  ggtitle("Smoker Distribution in Each Region") +
  xlab("Region") +
  ylab("Percentage") +
  theme_minimal()
```


Histogram bmi
===

Column {data-width=500}
---

```{r Histogram bmi}
ggplot(insurance, aes(x = bmi)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  ggtitle("BMI Distribution") +
  xlab("BMI") +
  ylab("Count") +
  theme_minimal()
```

Column {data-width=500}
---

### Analysis
The histogram of the bmi distribution looks normally distributed.

Histogram Charges
===

Column {data-width=500}
---

```{r Histogram Charges}
ggplot(insurance, aes(x = charges)) +
  geom_histogram(binwidth = 1000, fill = "green", color = "black") +
  ggtitle("Charges Distribution") +
  xlab("Charges") +
  ylab("Count") +
  theme_minimal()
```

Column {data-width=500}
---

### Analysis
The histogram of the charges distribution is heavily skewed right.

Boxplot bmi
===

Column {data-width=500}
---

```{r Boxplot bmi}
ggplot(insurance, aes(x = region, y = bmi, fill = region)) +
  geom_boxplot() +
  ggtitle("Distribution of BMI Based on Region") +
  xlab("Region") +
  ylab("BMI") +
  theme_minimal()
```

Column {data-width=500}
---

### Analysis
The northwest region is the only plot without an outlier. All of the regions looks normally distributed. Southeast region quartile range is on the bigger side compared to the rest.

Scatterplot
===

Column {.tabset data-width=550}
---

### Scatterplot 1

```{r Scatterplot 1}
ggplot(insurance, aes(x = age, y = charges)) +
  geom_point() +
  ggtitle("Relationship between Age and Charges") +
  xlab("Age") +
  ylab("Charges") +
  theme_minimal()
```

### Scatterplot 2

```{r Scatterplot 2}
ggplot(insurance, aes(x = age, y = charges, color = smoker)) +
  geom_point() +
  ggtitle("Relationship between Age, Charges, and Smoker Status") +
  xlab("Age") +
  ylab("Charges") +
  theme_minimal()
```

Column {data-width=450}
---

### Analysis of Scatterplot 1
At certain areas of the plot there is like 3 groups of points that are together. It also looks like the older you are the more you will be charged.

### Analysis of Scatterplot 2
It looks like if you did not smoke overall your charges were less compared to those that did smoke. In certain areas of the graph it also looks like even if you did not smoke your charge is around the 20k to 30k range.

Scatterplot Extra
===
```{r}
smoker <- insurance[insurance$smoker == "yes", ]
nonsmoker <- insurance[insurance$smoker == "no", ]
```

Column {.tabset data-width=550}
---

### Scatterplot 1

```{r}
ggplot(smoker, aes(x = age, y = charges)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  ggtitle("Relationship between Age and Charges for Smokers") +
  xlab("Age") +
  ylab("Charges") +
  theme_minimal()
```

### Scatterplot 2

```{r}
ggplot(nonsmoker, aes(x = age, y = charges)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  ggtitle("Relationship between Age and Charges for Non-Smokers") +
  xlab("Age") +
  ylab("Charges") +
  theme_minimal()
```


Column {data-width=450}
---

### Analysis of Scatterplot 1
The smooth line goes in between the 2 groups of points. I would believe it would make senese to use the smooth line to summarize the relationship between age of clients and the corresponding charges. It helps capture the underlying trend in the data and provides an initial insight into the relationship. However, it may not capture more complex patterns and outliers.

### Analysis of Scatterplot 2
In this case the line is closer to more of the points. There are minorities of scattered points above the line.

### Overall
There are many things you could do if you want to model charges using other variables in this data. You could identify and handle outliers that might impact the data. You could also find the impact of categorical variables like sex or region on charges

Pie Chart
===

Column {data-width=500}
---

```{r Pie Chart}
children_count <- count(insurance, children)
children_count$percent <- round(children_count$n / sum(children_count$n) * 100, 2)

pie_chart <- ggplot(children_count, aes(x = "", y = percent, fill = factor(children))) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(children, "\n", percent, "%")),
            fontface = "bold", color = "black",
            position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette = "Oranges") +
  theme_void() +
  theme(text = element_text(size = 20)) +
  labs(fill = "Number of Children")

pie_chart
```

Column {data-width=500}
---

### Analysis
A majority of the people in the data have 0 children. The more children a person has the lower the percentage.

Boxplot Extra
===

Column {data-width=500}
---

```{r Boxplot Extra}
ggplot(insurance, aes(x = factor(children), y = charges)) +
  geom_boxplot() +
  ggtitle("Distribution of Charges Based on Number of Children") +
  xlab("Number of Children") +
  ylab("Charges") +
  theme_minimal()
```

Column {data-width=500}
---

### Analysis
There are many outliers much more in 0 children and 1 child compared to the rest. It also looks like all of the plots are skewed right.